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Director of Technology Center 2100 
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Alexandria, VA 22313-1450 
Sir: 

This is a response asserting that the outstanding Office Action mailed March 19, 2007 
the finality of the same are improper. The outstanding Office Action has a period for response 
set to expire on June 1 9, 2007. 

Overview 

The Applicant submitted a Request for Continued Examination with the previous 
Amendment filed on January 3 T 2007. The Examiner indicated in the outstanding Office Action 
that the application is eligible for continued examination, and the finality of the previous Action 
was accordingly withdrawn pursuant to 37 C.RR. § 1.114. fn the Amendment, the Applicant 
added new claim 20, which recited features somewhat similar to method claim 19, but written in 
means-plus-function format The Applicant particularly used means-plus-function language in 
claim 20 requiring interpretation under 35 U.S.C. § 11 2 U 6. Despite the fact that means-plus- 
function claims must be analyzed differently than method claims per 35 U.S.C. § 112 U 6, the 
Examiner improperly made the current Office Action final. The Examiner has not indicated how 
the features of means-plus-function claim 20 are being considered, whether interpretation under 
35 U.S.C. § 112 6 has not been taken, or what corresponding parts of the specification 
correlated to the recited "means" in claim 20. 

Additionally, in the outstanding Office Action, the Examiner failed to consider references 
from the IDS of July 31 , 2006 that were properly submitted to and received by the PTO. In the 
current Office Action, the Examiner stated that "copies of the non-patent literature (items AM, AN 
and AO) were not submitted and thus not considered" (see page 2, item 4, of the Office Action). 



As indicated on the Return Postcards (see attached copy), the USPTO acknowledged receipt of 
the documents, which were included with the IDS. As such, the record is incomplete and entry 
of a final Action is improper. 

The Examination of Means-Plus-Function Claims Differs from Method Claims 

The Applicant respectfully submits that the outstanding Office Action is improper and the 
finality of the same is improper for the following reasons. A means-plus-function claim must be 
examined differently than method claims pursuant to 35 U.S.C, § 112 tf 6, which states "such 
claim shall [emphasis added] be construed to cover the corresponding structure, material, or 
acts described in the specification and equivalents thereof." As such, a means-plus-function 
claim containing somewhat similar limitations to a method claim presents new issues for 
examination. 

Further, the Applicant respectfully points out that the present final Office Action does not 
meet the requirements of 37 CFR 1 .113(b), which states "[i]n making such final rejection, the 
examiner shall repeat or state all grounds of rejection then considered applicable to the claims in 
the application, clearly stating the reasons in support thereof." In the present Action, the 
Examiner repeated the rejections made in the method claim, without more. In fact, it is readily 
apparent from the Office Action that the Examiner merely cut his arguments with respect to claim 
19 and pasted them into the rejection of claim 20 (compare the rejection of claim 1 9 on pages 8 
and 9 with the rejection of claim 20 on pages 9 and 10). There is no evidence in the record that 
the Examiner ever analyzed claim 20 in the manner required by 35 U.S.C. § 112 U 6. 

As there were no means-plus-function claims in the claim set prior to the addition of claim 
20, it is clear that no such analysis was conducted for prior Amendments. Contrary to MPEP 
707.07(f), there is no clear explanation anywhere in the present Office Action that the Examiner 
has properly considered means-plus-function claim 20, According to 37 CFR 1.104(b), the 
Examiner's answer must be complete as to ail matters. Thus, the finality of the present Office 
Action was improper. 

References not Considered with the July 31, 2006 IDS 

Additionally, references AM, AN and AO properly filed with the IDS were not considered. 
These references were properly filed and received by the USPTO, as indicated by the Return 
Postcard. Without consideration of these documents, the record is incomplete. Hence, the 
finality of the outstanding Office Action is further improper for this reason. 

For convenience, further copies of the references are attached hereto. 



Conclusion 

In light of the above, the outstanding Office Action and the finality thereof are improper. 
Therefore, it is respectfully requested that the finality of the Action be withdrawn. 

Respectfully submitted, 

STAAS & HALSEY LLP 
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Speculative Execution and Reducing Branch Penalty 
v on a Superscalar Processor 



Tetsuya HARAf, Nonmember and Masao NAKAYAf, Member 



SUMMARY Superscalar processors improve performance 
by exploiting mstruction-level parallelism (ILP). ILP in a basic 
block Is, however, not sufficient on non-nmnerical applications 
for gaining substantial speedup. Instructions across branches are 
required to be executed in parallel to dramatically improve 
performance. That is, speculative execution is strongly required. 
Boosting is a general solution to achieving speculative execution 
Boosting labels an instruction to be speculatively executed, and 
the hardware handles side-effects. This paper describes the 
efficient implementation of boosting in terms of cost/perfor- 
mance trade-offs. Our policy in implementation is beneficial in 
code scheduling heuristics, penalties imposed by code duplica- 
tion to maintain program semantics, and area cost This paper 
also describes a branch scheme which minimizes branch penalty. 
Branch delay causes crucial penalties on the performance of 
superscalar processors stnce multiple delay slots exist even in a 
single delay. cycle. Our scheme is the fetching of both sequential 
and target instructions, and either of them is selected on a 
branch. No delay cycle can be imposed. This scheme is realized 
by a combination of static code movement and hardware sup- 
port. As a result, we reduce branch penalty with small cose 
Simulation results show that oar ideas are highly effective in 
improving the performance of a superscalar processor, 
key words: superscalar, VLIW $ speculative execution 

1. Introduction 

Multiple instruction execution is a key aspect in 
improving the performance of microprocessors. Most 
state-of-the-art microprocessors"^ have superscalar 
architectures to exploit instruction-level parallelism 
(ILP) , 

ILP is constrained primarily by two factors: data 
dependence and control dependence. Data dependence 
is classified into true dependence, anti-dependence, or 
output dependence. Instructions with true dependence 
must be executed sequentially. On the other hand, 
anti- or output dependence can be removed because 
they are artificially introduced. For example, register 
renaming makes these instructions executed in parallel. 
Unfortunately, true dependence is dominant In partic- 
ular, the number of instructions in a basic block on 
nonnumerical applications is small (approximately 
five), and most of them must be executed sequentially 
due to true dependence^ 



Manuscript received December 25, 1992. 
Manuscript revised March 9, 1993. 
1 The authors are with LSI Laboratory, Mitsubishi 
Electric Corporation, Itami-shi, 664 Japan. 



Control dependence is imposed due to condi- 
tional branches. Since it is unknown whether instruc- 
tions in a basic block after a conditional branch will 
be executed, instructions in the basic block before the 
branch and instructions after the branch must be 
executed sequentially. 

ILP which can be exploited within basic blocks is 
limited on non-numerical applications. It is only 1.5 
under an ideal assumption which includes infinite 
hardware^ In fact, those current superscalar micro- 
processors which exploit ILP only in a basic block 
show small performance benefit over scalar processors 
on non-numerical applications. 

Limit studies^""™ show us the limit of ILP which 
can be exploited under ideal assumption (e.g. infinite 
number of function units, one cycle operation, etc.). 
Although these studies show different number of limits 
(2-40) because of different assumptions, they indicate 
that speculative execution is particularly important in 
gaining ILP benefits. Speculative execution is defined 
as the execution of instructions whose ultimate validity 
depends on the condition of a branch. Speculative 
execution, therefore, minimizes control dependence. 

Compiler techniques^"" 2 ) exist to realize specula- 
tive execution. Compiler approaches have large scope 
and good heuristics in instruction movement, and do 
. not have run-time overhead. Unfortunately, basically 
these techniques move instructions across* branches 
only if the operation of the moved instructions neither 
changes the program semantics nor causes an excep- 
tion. We term those operations safe and legal. A 
speculative operation is said to be safe if that operation 
cannot cause an exception to occur, and a speculative 
operation is said to be legal if that operation does not 
overwrite a location whose previous value is needed by 
some other instructions when the program control is 
taken in ways other to the way the instruction of that 
operation is moved. Because of the limited capability 
of speculative movement* these compiler techniques do 
not yield substantial performance gain. 

Loop-level optimization techniques, such as loop 
unrolling and software pipelining^ are also useful for 
exploiting ILP. While these techniques are particular- 
ly useful on numerical applications because inner-mosi 
loops consume a large part of CPU time, they are 
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Fig. 1 Hardware organization of SARCH. 
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Fig. 2 Pipeline. 



tions sequentially. The decision of the number of 
integer ALUs will be discussed in Sect 5. 

Figure 2 shows the pipeline — IF: instruction 
fetch, ID: instruction decode, hazard analysis, execu- 
tion of a branch instruction (including branch address 
calculation), and instruction issue, EXC: execution 
and data address calculation, MEM: data cache 
access, and WB: register write. The instruction set is 
nearly identical to the MIPS R2000 RISC instruction 
set. (27) The latency of all integer instructions except 
load and branch instructions is one cycle. Since data 
is loaded from data cache in the MEM stage, the 
latency of a load instruction is two cycles. A branch 
instruction is executed in the ID stage. Thus, the 
latency of the branch instruction is two cycles. 
SARCH uses neither a delayed load nor a simple 
delayed branch like R2000 does. 

In general, the amount of ILP changes dynami- 
cally on run-time. Issuing fixed number of instructions 
to sustain peak ILP like very long instruction word 
(VLIW) machines yields an enormous number of 
no-ops in the code. Therefore, dynamic hazard analy- 
sis is required to keep code size in scalar machines. 
The hazard analyzer determines which instruction can 
be issued and then only instructions without hazards 
are issued. In other words, the hazard analyzer inserts 
no-ops dynamically. 

Hazards can be caused by data dependence and 
resource conflicts. Data dependence analysis is per- 
formed by the comparison of register numbers and 
checking a reservation table of the registers. The 
comparison among destination register numbers -and 
source register numbers of candidates for parallel issue 
is sufficient for the hazard analysis of instructions with 
one cycle latency ' because all results before the WB 
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Fig. 3 Instruction group. 
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Fig. 4 Instruction queue. 

stage are bypassed. On the other hand, those candi- 
dates that will use data loaded from the data cache 
should be blocked until the data becomes available 
because the latency of a load instruction is two cycles. 
A register reservation table is used for an availability 
check of loaded data. The register reservation table 
has 32 entries associated with a register number. An 
issued load instruction marks the entry associated with 
its destination register number, and a load instruction 
which completes the execution resets the entry associat- 
ed with its destination register number. By checking 
the register reservation table, the hazard analyzer can 
find the availability of source registers, 

SARCH fetches four instructions every cycle. We 
call those instructions the instruction block. To pack 
instructions, we should allow an instruction group, 
instructions which can be issued in parallel, to be 
located across an instruction block boundary (Fig. 3) . 
To meet this requirement, we employ a dynamic 
window. The window moves along the instruction 
stream. The instructions in the window are supplied to 
the hazard analyzer and do not depend on instruction 
block boundaries. The window moves according the 
number of instructions issued every cycle. 

Figure 4 shows the implementation of the 
dynamic window. The dynamic window is realized by 
the instruction queue* The queue has eight latches 
which store instructions. Four fetched instructions are 
written in the latches 0-3 or 4-7 every cycle. Each 
latch is connected to three bit lines through a switch 
box. The switch box is controlled by the window 
pointer indicating the top entry of the window. Data 
in three successive latches indicated by the window 
pointer is read out to the bit lines. For example, if the 
window pointer is 2, data in latch 2, 3, and 4 are read 
out. The movement of the window is performed by 
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Fig. 7 Code duplication on boosting limited in a trace. 



transferred to D from A (409S), a code movement from 
C to A seems to create better code. We should, 
however, consider the penalty imposed on off-trace 
paths. Since C is a join block, that is, C has a 
predecessor (block B) other than B, code duplication 
to E is required to maintain semantics when a code 
movement from C to A is performed. This duplication 
may impose a penalty on the off-trace path. If B is also 
a join block, the scheduler needs code duplication 
again for the predecessors of B. In general, the amount 
of duplicated code increases with the number of condi- 
tional branches that are moved across. Thus, a code 
movement which replies on static branch prediction 
does not necessarily create the best code in terms of the 
performance. 

The capability of code movement along an unlike- 
ly path is also effective to suppress penalties imposed 
on unlikely paths when code duplication is necessary. 
The boosting model which allows code movement only 
in a trace has constraint on code duplication. Con™ 
sider the case in Fig. 6 again. When a code movement 
from C to A is performed, the code should be duplicat- 
ed to E. Since this code duplication is speculative, 
boosting may be required (Notice that labeling of 
boosting is required only in the case of unsafe or illegal 
movement) . In the case in Fig. 6, however, since the 
path from E to C is not a probable path, and conse- 
quently, is not a trace, code movement from C to E is 
not allowed if it is an unsafe or illegal movement 

The scheduler that relies on trace paths has three 
options to handle in this case. The first option is that 
those movements where duplication is unsafe or illegal 
are not allowed. Although this option is easily im- 
plemented, it imposes great constraints on scheduling, 
and consequently degrades the performance dramati- 
cally. The second option is to change the entry point 
of the branch in B into the point of the next instruction 
to the moved instruction if C is not a fall-through 
block of B (Fig. 7(b) ) , This is effective, but it cannot 
be performed if C is a fall-through block of B 4 The 
third option is to dynamically make a new basic block 
(block C) which contains duplicated code (Fig. 7 
(c) ) . This option keeps the semantics, but may cause 
a penalty on the off-trace path (E to C through C r )- 
In contrast, the scheduler in two-way boosting has 
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Fig. 8 Code duplication on two-way boosting. 



no constraints on code duplication. Two-way boost- 
ing allows the duplicated instruction to be moved into 
E in all cases. Figure 8 shows the duplication for the 
duplication in the case of Fig, 7(c), If the scheduler 
moves il into B, il is labeled iLn, and duplicated code 
is placed in E labeled il.t, where the label .n means 
that a instruction labeled m is valid if the associated 
branch is not taken* and similarly the label .t means 
that an instruction labeled .t is valid if the associated 
branch is taken. The duplicated instructions will be 
probably issued along with instructions which belong 
to the original block. Therefore, penalties imposed by 
code duplication can be minimizetl In our code 
scheduler, the second option described above and 
two-way boosting is incorporated to produce the best 
schedule. 

Another advantage of two-way boosting is remov- 
ing unconditional branch instructions for a short 
branch-and-join. Consider the example shown in Fig. 
9(a), As shown ha the control flow graph Fig. 9(b), 
the block A branches to B and C, but they are joined 
into D with a short run. The MIPS R2000 optimized 
code and the S ARCH optimized code is shown in Fig. 
9(c) and (d), respectively. Notice that the uncondi- 
tional branch instruction (jump EXIT) is removed. 
This optimization is similar with one in the guarded 
instruction modeL C2oJ A guarded instruction is condi- 
tionally executed depending on a value in the register 
designated in the code. Two-way boosting, however, 
has an advantage over the guarded instruction model 
since it does not need predicate dependence. That is, 
in the guarded instruction model, B and C can be 
issued in parallel, but A, B and C cannot be executed 
in parallel since the execution of B and C should wait 
until the branch condition is determined in A. 
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If a branch instruction is executed and the branch 
is taken, target instructions which were statically 
moved are executed in the next cycle. Because of the 
sequential placement of the target instructions, no 
branch delay is imposed. At the same time, a new 
target address (not the original target address, but the 
newly created target address after the movement of the 
target instructions) is output to the instruction cache. 
The dynamically fetched target instructions are 
appended below the statically moved target instruc- 
tions in the instruction queue. Since these instructions 
can be executed immediately, the pipeline does not 
staiL Figure 1 1 shows the image of the instruction 
queue after fetch of the new target instructions. 

Figure 12 is an example explaining the branch 

instruction queue 



instiudron 
btock 



branch 




statically moved 
target Instruction 



dynamically fetched 
r target instructions 



Fig. 1 1 hnage of the instruction queue after fetch of the new 
target instructions. 



scheme. Figure 12<a) is an example code in memory 
where i2-i4 are instructions to be executed when th< 
branch is untaken, and ti~t9 are instructions to be 
executed when the branch is taken. Figure 12(b) is th* 
branch taken case. The instruction group is a group o 
instructions which are issued in parallel. This group 
ing is done by the hazard analyzer described in Sect. 2 
RecaU that the instructions in the window (three- 
instruction wide) pointed by the window pointer 
(denoted by wp in the figure) are read out to the 
hazard analyzer. In the cycle the instructions (H 
branch, tl) are issued due to no hazard in this exam- 
ple. Because the branch is taken, tl is issued alon| 
with il and branch. In the next cycle, the cycle «4-l. 
(tX t3 9 t4) are issued without delay because of the 
static movement of the target instructions. At the same 
time, the target PC (address of LAB) is sent to the 
instruction cache. In the next cycle, the cycle n + (t5, 
t6, t7) can be issued immediately because t6-t9 are 
loaded to the queue in this eyelet 

Figure 12(c) shows the case of the untaken branch 
without delay. In the cycle (il, branch, tl) is read 
out from the instruction queue, and only (ii, branch) 
is issued. The instruction tl is nullified in this cycle 
because it is the target instruction. In the next cycle, 
the window pointer jumps and is set to 8 so that (12, 13 S 
i4) can be read from the instruction queue. These 
instructions are issued in the cycle n^l without delay. 

T The figure presents that (t5, t6, t7) are fetched in the 
cycle but if we precisely present, t5 is fetched in the 

cycle n due to the instruction, prefetching, while (t6, t7) are 
fetched in the cycle n+L 
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(Fig- 12(d)). The average issue rate is between two 
and three, and thus the possibility of the pipeline stall 
due to the un taken branch is reduced. Notice that the 
bandwidth of the instruction fetch (four-instruction 
per cycle) is determined to meet the requirement for 
sustaining the peak issue rate (three-instruction per 
cycle), not to meet the requirement for the branch 
scheme. That is, we do not pay cost for implementa- 
tion of the branch scheme in the instruction fetch 
bandwidth, 

4. 2 Comparison with a Previous Branch Scheme 
Prefetching from Both Directions 

Prefetching from both directions is employed in 
mainframes (e.g. IBM370/168). In these machines, the 
condition code (CC) which is needed for the branch is 
determined in a late stage. Because of late CC setting, 
as well as higher fetch rate than the issue rate, the 
machine can afford to fetch instructions from both 
directions. There are two buffers to store instructions 
for the sequential and target path before an instruction 
register. When an instruction decoder following the 
instruction register finds a branch, the target address, if 
operands for target address calculation are available, is 
sent to the instruction cache, and the target is fetched 
in the target buffer. If the branch was resolved, either 
the sequential or target instruction from the buffers is 
selected according to the result. 

Our scheme is similar to the scheme above, but is 
efficiently realized in the RISC-type superscalar 
machine. Before describing the primary benefits, the 
secondary benefit is that our scheme does not send the 
target address until the branch is found to be taken. In 
the mainframe schemes, the target address is sent to the 
instruction cache before the target is not found to be 
needed. Since the target address is not the sequential 
address of the previous instruction address in general, 



this extra instruction reference does not take advantage 
of spacial locality. Although the LSI technology has 
advanced, the primary cache which can be integrated 
in a single chip is limited (currently from 8 Kl (3 > to 20 
KB Cl} ) . Therefore, the extra instruction reference might 
cause the cache (and TLB) miss in a small processor 
unlike mainframes. The handling of these misses 
should be suppressed because it is unknown whether 
the processor should really handle them or not. This 
control might be realized, but makes the hardware 
complicated. 

Ignoring the secondary benefit, it is useless to 
employ the branch scheme in mainframes as it is 
because CC is determined early in RISC machines. 
RISC machines do not need memory-operands for an 
ALU operation; ALU operations are performed 
between only register-operands. Therefore, a RISC 
machine has neither a operand-address calculation 
stage nor a memory-operand fetch stage before an 
execution stage, which are required for mainframe 
instruction-set architecture. That is, the execution 
stage (or EXC in S ARCH, see Fig. 2) is placed imme- 
diately after the register-operand fetch stage (or ID in 
SARCH). Therefore, CC is available one cycle after 
the issue of the CC-set instruction. Furthermore, 
compare-and~branch, which is employed in SARCH, 
does not need CC setting since the equivalent opera- 
tion is performed in the ID stage together with CC 
testing and branch address calculation. In other word, 
the branch instruction can be executed immediately 
after it is fetched, while the execution is suspended 
until CC is set by the previous instruction in main- 
frames. Therefore, the branch scheme where the target 
address is sent if the decoder finds the branch instruc- 
tion is useless in RISC machines. 

The revised version to the branch scheme in 
mainframes is to decode instructions early before they 
are decoded in the instruction decoder. This predecod- 
ing enables to send the target address and fetch the 
target before the branch execution, but requires regis- 
ters to hold the fetched instruction block other than the 
buffers and a decoder to find the branch instructions; 
the original instruction decoders are still needed to 
determine the branch direction. This duplication costs 
approximately 3K transistors ((400 transistors (regis- 
ters) + 300 transistors (decoder)) X4 (instructions per 
block) 4-250 transistors (selector to share the branch 
address generator) ) . Unlike this revised scheme, our 
scheme does not need the hardware for predecoding. 

Finally, we should consider a penalty when the 
prefetch from both directions is not successfully per- 
formed. In SARCH code, basic block length, which is 
cycles consumed from the entrance to the exit, is 
extremely short. According to dynamic statistics we 
collected from benchmark programs (see Table I), 37% 
of the basic blocks is just a single cycle tn length; 49% 
is less than two cycles in length. Under this situation, 
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which can be exploited in a basic block is strictly 
limited. As mentioned in Sect. 1, most instructions in 
non-numerical applications are dependent with true 
dependence. Therefore, there is little ILP in the basic 
block. Further-more, basic block scheduling does not 
solve the problem of control dependence, As a result, 
the performance is not improved significantly by basic 
block scheduling. 

Figure 15 shows performance improvement when 
two-way boosting and o ur branch scheme are 
introduced. The average speedup is L36jc in the 
machine with two ALUs, and is Mix in the machine 
with three ALUs. These numbers in improvement 
indicate that up to three ALUs are beneficial in our 
scheduling scheme. The machine with four ALUs, 
however, achieves only a 1.2% performance benefit over 
the machine with three ALUs. The primary reason of 
the performance limit is that the load/store instruc- 
tions must be executed sequentially, while the ALU 
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Fig. 14 Performance improvement with basic block scheduling 
only. 
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Fig. 15 Performance improvement with two-way boosting and 
our branch scheme. 



instructions can be executed in parallel. Therefore, the 
parallel execution of load/store instructions is the next 
most important aspect to improving performance. 

^ Figure 16 surnmarizes the performance benefits 
which are obtained by two-way boosting and our 
branch scheme in the three issue machine. In model A, 
only basic block scheduling is performed- In model B, 
the scheduler schedules code with two-way boosting, 
but our branch scheme is not introduced Model C 
employs both two-way boosting and our branch 
scheme. The figure shows that two-way boosting 
achieves a 21 A% performance improvement over basic 
block scheduling, and our branch scheme achieves a 
further 8.1% performance improvement. The total 
performance improvement of 29S% is achieved, Thus 
our ideas are highly effective in performance improve- 
ment- 

An interesting question to ask is comparison with 
the boosting which schedules instructions across multi- 
ple branches in terms of performance and hardware 
amount This type of boosting is supported with 
shadow structures for speculative state buffering. 
From the view point of the hardware amount, it is 
obvious that our boosting needs a smaller amount of 
hardware because shadow structures are not needed. 
For example, the number of transistors in a six-read,, 
three-write register file, which is required for three- 
issue SARCH, is 23 K transistors*, and thus extra 23 K 
transistors are required for a shadow register file. Only 
increase from one-level original boosting is extra squa- 
shing logic in pipelines, but it is extremely trivial. Yet, 
this increase is equivalent or smaller than two- or 
more-level original boosting. 

To compare performance with the original boost- 
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Fig. 16 Performance impact of two-way boosting and our 
branch scheme. 



t The amount of transistors is quite significant because it 
is 27% more than the number of transistors in a sinsje 
integer unit in SARCH, 
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substantial speedup gain. 
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^ n 1987. as the first Sun 4 workstation moved into full production an 
.. emerging emitter coupled logic technology caught Sun Microsystems' 
fcJ attention. ECL promised very large scale integrated densities at much 
higher operating frequencies than CMOS (complementary metal-oxide 
semiconductor) technology. The convergence of this technolosy with our 
Scalable Processor Architecture (Sparc) would for the first time enable a 
complete microprocessor to be implemented in ECL. 

Sun initiated a joint development project with Bipolar Intesrated Tech- 
nology to implement Sparc using BITs new bipolar process. Initial work 
indicated that a clock cycle time of 12.5 nanoseconds, an execution rate of 
I..? clock cycles per instruction, and benchmark performance of 60 millio- 
instrucuons per second and 12 million floating-point operations per second 
were achievable with an entry-level price of 5 100.000 
This was an aggressive goal considering that mini-supercomputers then 

t TSvi e non° Pmem , tarS , eted Ci ° ck rateS ° f 25 to 40 m «ahertz and S200.000 
to ^00.000 entry-level prices. We were confident that the simplicity of 
Sparc would put the processor core on less than a dozen chips, and Sun's 
wor<stat,on heritage would fi, the entire 80-MHz processor onto one circuit 
card. We minimized the cost by using air cooling. P in grid array and dual in- 
line : packaging 1 OK technology, and conventional printed circuit boards 

We believe that earh adoption of new technology is the fcev to creatine 
compentive products. However, chip development can no longer be sepa- 
rated from system design since much of the system now resides on the 

con-htifif " lh '' n r fa r° rS - i V mind ' WC aSSSmb,ed a «S 'that 
JIT u- ? n '- mber ° f IC en S ineers and system or board-level 

designers. We kne* mat board-level issues such as RAM access character- 
istics, transmission line design, and clock skew control as well as system 
architecture would determine many of the requirements for the VLSI chips 
Here we briefly review both ECL technology and the features of BIT' s 

the cOt^T* and * SCUS . S h ™ boanJ and cacf * considerations influenced 
the chip designs. Discusston of the integer unit pipeline, svstem interface 
signals, and coprocessor interface concludes the article. The chip set, now 
R°ST. $ commercially from BIT as the B5000 series See The 
Microprocessor box. 



Technology 

ECL is a digital bipolar technology generally used for applications in 
TheTeSv 1 P °* Z di f PZli ° n lCSS im P™™ ^an switching speed. 
I ci S L Zl* - P ° iar Iranst ^ ors of a ^aditional ECL design resulted 
Z Jr u lower integration density than their CMOS counterparts In 

Uo drh P e°r r d,SSiPati ° n " ^ High bCCaUSe ««« had to be 

biased to drive longer, more capactiive internal signal linesT 

0:72-!732/VO/0200-00l0S0!.0O e 1900 IEEE 
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SnHhV J»/f tiI BW 7 er ( i a1 ' a . two " ievel series ^ implementing the function {A + 8HC+ D) and its comple- 
ment (b}, and a three-level series gate implementing the function (A + B)(C+ D)(E+ f) P 



BIT achieved a double breakthrough when it 

• reduced the physical size of the bipolar transistors 
and 

• provided a typical unloaded gate propagation delay 
of 375 picoseconds, while biasing each irate with 70 
microamperes of static current. More traditional ECL 
techniques use 200 \xA to ! mA to achieve comparable 
switching speed. 

With three layers of interconnect metallization, ihe 
process is ideally suited for building VLSI devices. 1 - 

All three layers of metal distribute power on these 
devices to minimize the voltage drops alone the bus and 
to avoid metai migration. A package with an embedded- 
copper-tungsten slug having high thermal conductivity 
dissipates the power- The die bonds direct! v to the slu* 
which transfers the heat to the top of the" package. A 
heat sink in a forced airstream dissipates the heat~ The 
resulting thermal resistance (S ) is about 2.5 degrees 
centigrade per watt. 

An ECL inverter consists of a differential pair, a 
current source, a load resistor, and an output driver 
(Figure fa). By adding transistors in parallel to the 
input transistor of the differentia) pair, we create an Or 
function. If we add another differential pair between 
the original pair and the load resistor, we create an And 
function. 

This stacking of differential pairs is called series- 
gating. A traditional ECL process allows only tw a 

!7 e of/ SCHeS ^ gating - Fore * ar np!<=- the logic function 
{A + B)(C + D) can be implemented as one sate (Figure 
lb). The BIT process allows three levels of series- 
gating, which supports functions such as (A + B)(C + 
D)(E+ F), as shown in Figure !c. The penalty for the 
additional functionality is an increase in the propaga- 
tion delay; however, this increase is generally less than 
if the function was decomposed into two gates. Three 
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levels of series-gating proved useful in constructing the 
shift registers that make up the diagnostic scan chain in 
the integer unit, or IU, and the float! ne- point controller, 
or FPC The required multiplexer and latch functions 
combine into one gate, 

ECL circuits can efficiently drive transmission lines 
at high frequencies. So the IV and the FPC each contain 
a sel of low-impedance (25-ohm) output drivers to 
drive system buses directly. We minimized the timing 
skew between these outputs by matching delavs on the 
chip. We also designed package traces from the die to 
the pin to match the impedance of the drivers, 

BIT technology supports two ECL interface stan- 
dards, designated lOKand 100K. The I00K interface 
uses temperature compensation circuitrv to minimize 
the temperature-induced shift in the switching 
threshold. The 10K interface specifies the amount of 
threshold shift that is allowed across the operating 
temperature range. All 100K circuits operate with a 
-4.5 volt supply, while I OK circuits operate at -5 *> 
volts. We chose to adhere to the 10K standard because 
« offers a wider selection of standard components. The 
coolmg system maintains a maximum junction- 
temperature gradient of 25*C across the board to 
minimize differences in switching thresholds between 
devices. This system is necessary to maintain noise 
immunity and to control temperature-induced 
swuching skew. 



Board design influences 

We started with several studies of cache designs. 
Although we looked at a variety of concepts, the 
simplest design offered the shortest access time, and 
the more complex designs did not offer comparable 
advantages. 
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several basic decisions we made to. minimize signal 
propagation delays: 

* only ihe IU drives the cache address bus: 

* the cache data bus only connects to the IU and the 
floating-point unit, or FPU; and 

* the cache write operation does not limit the cycle 
time: it is preferable to use two cycles for each write. 

As a result, all data entering the IU or FPU comes 
from the cache RAMs, and cache data must pass through 
the IU or the FPU to get to memory. Noncached data 
(such as system control registers, or power-on boot- 
strap code) moves through the cache. First, the cache 
controller saves the addressed word from the cache in 
a temporary register, writes the desired data into the 
vacated cache location, and from there into the IU or 
FPU, then restores the original cache data from the 
temporary register. 

The CPU core seen in Figure 2 contains the Sparc 
integer unit, the five-chip FPU t the cache RAM array, 
the tag match chip, and four copies of the system data 
path gate array. The FPU consists of the controller, two 
register file chips, the double-precision ALU, and the 
double-precision multiplier. The tag match chip imple- 
ments the cache miss logic and the tag portion of a four- 
entry translation (lookaside buffer) cache, or TLB. The 
system data path contains the data portion of the TLB 
and provides interconnections between the major units 
of the CPU via the CPU bus. It also contains the 
memory bus interface. 



Cache des 

With each address pin of a RAM presenting 5 to 7 
picofarads of loading in DIP form, the AC impedance 
of an address bus with the RAMs packed as closelv as 
possible is about 25 ohms. Wkih 25 RAMs (13 for data 
and parity, and 7 for tags), the delay on the address 
transmission line totals about 10 ns. By splitting the 
cache i nto two banks, each driven by its own copy of the 
address, we halve the delay time to 5 ns. We provided 
the IU with 25-ohm differential drivers for 16 address 
lines, enough to support a 5 12- Kbyte cache. A further 
split into four banks is possible' by using external 
drivers, but these can add significant" delay and skew. 

We optimized the tag-access and compare operations 
for speed tn several ways. First, the tag RAMs are four 
times smaller than the data RAMs and thus have a 
shorter access time. Second, we designed a high-speed 
ECL gate array (the tag match chip) to perform the tag 
match computation, and third, placed the tag RAMs so 
that they are the first to receive the address from the IU, 
Systems with one-cycle access caches have in the 
past required RAMs with a read-access delay signifi- 
cantly shorter than the cycle time of the machine. We 
assumed thai such RAMs would be unavailable, eoo 
expensive, or too small .for this machine, so we de- 
signed the pipeline to allow extra time in the cache 
access stage. We "borrow" time from the address gen- 
eration stage (see Figure 3), which operates in less than 
7.5 ns. An on-chip cache address' register (CAR) is 
clocked with an early clock to enable" the next cache 
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dress changes, and the store data arrives one cycle later 
than the address. There must also be rime to" stop the 
write pulse from occurring if the tag check results in a 
cache miss. If the store instruction is followed by a 
nonmemory-access instruction, its execution will" be 
overlapped with the third cycle of the store, reducing 
the effective cost of the store to two cycles, * 

The timing of store and swap {seen in Figure 4a) 
illustrates the critical placement of the write puise 
between the arrival of the data and the end of the 
address. Integer load-double (Figure 4b1 takes two 
separate load cycles. Integer store-double (Figure 4c) is 
the only three-cycle instruction. Floating-point store 
and store-double instructions take one cvcie in the 
pipeline, although it takes three cycles ro complete the 
cache write (Figure 4d), 



IU pipeline 



Sparc offers a simple instruction set based on a 
register-to-register paradigm. Only load and store in- 
structions access memory and ihe only addressing 
modes used by these instructions are red$ter~plus^ 
register and register-plus-immediate. Reader specifi- 
ers appear in the same bit fields of every instruction. 
Branch instructions carry an imrnediare. proeram 
counter-relative offset. 5 

The simplicity of Sparc allows all of the instructions 
to fit neatly into a fixed pipeline which, on the B5000 
consists of five stages: fetch (F). read (R). execute < E) 
memory (M), and write { W). (Note in The Sparc Archi- 
tecture box that pipelining is not a required feature of 
any Sparc implementation.) Each stage completes its 
processing in one clock cycle. During the' F stage 
instructions move from memory into the" nrocessor. 
Then the processor reads operands from the register 
file, decodes opcodes, and dztzczs instruction depen- 
dencies in the R stage. The operands enier either the 



arithmetic and logic unit (ALU) or the shift unit in the 
E stage. Load and store instructions in the M stase fetch 
data operands from memory', and arithmetic instruc- 
tions use a to move the arithmetic result back across the 
chip to the write port of the register file. In the W sraae 
the processor writes the ALU result or the data from 
memory into the register file. 

In the next clock cycle we use standard result- 
forwarding techniques to keep the pipeline full, even 
when an instruction's result is used.* We added a new 
data path to this implementation to allow load 
instructions to execute in one cy c Je. or two cvcles when 
the next instruction requires the data beins loaded The 
nardware detects and, handles' the two-cycle case- 
compiler support is not required. We call this an 
interlock action and in general use if when an instruction 
encounters a resource conflict or data dependency and 
cannot be issued in the current cycle. 

Figure 4 illustrates most of the multtpie-cvele in- 
structions. The integer store and store-double instruc- 
tions require two and three cycles respectively because 
the IU register file does not have a third read port to 
access the stored data in the R staae. However the 
floating-point load, load-double, store, and store- 
double instructions take just one evele in the pipeline 
due to external 64-bit buses and the separaie floating 
point register file- 



Instruction queue 

The B5000 contains a 64-bit data input bus on which 
two instructions can be fetched in parallel. One or both 
of these instructions will be inserted into a queue 
which has a maximum depth of four instructions. The 
queue allows the pipeline to complete one instruction 
every cycle even when memory access instructions 
occasionally use the data bus (Figure 5). The B stage of 

continued onp, }9 



The Sparc Architecture 



The reduced instruction-set computer, or RISC, 
architectures-developed at zne University of Califor- 
nia at Berkeley form the basis of the Scalable Proc- 
essor Architecture. Sparc contains onlv 32-bh-lons 
instructions in three formats. Operand specifiers 
appear in fixed positions in the instructions to enable 
rapid register file access. Delayed control transfer 
instructions allow an instruction that follows a con- 
trol transfer instruction to execute before the trans- 
fer of control occurs. 



In Figure B we can see that Format 1 is used only by 
the subroutine Call instruction. This format has a30-bir 
displacement, which allows a Cali to any word-aligned 
address in the virtual address space. Format 2 supports 
the Sethi instruction and the Branch instructions. The 
displacement allows Branch instructions to span 
16 Mbytes of the virtual address space. Format 3 sup- 
ports all of the other instructions. It has three 5-bit 
operand specifiers, or two 5^bit operand specifiers and 
one I3~bit signed immediate constant. The three- 
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Instrucrion Data types 

Memory operations 



Three of the fo Ur genera , Sparc instruction set. 



Dam width 



Variations 




Notes 



Load 
Store 



Swap 

Load/Store 



Signed 
Unsigned 



Unsigned 



integer computational 

And, Or 

XOR _ 
Add 

Sub 

MULSCC siwed 

Save *~ 

Restore 

Shift 

Read _ 

Write _ 



Byte, half word 
Word 

Double word 
Single, double 
Word 
By le 



Word 

Word 

Word 
Word 
Word 
Word 
Word 



Conirof transfer 
Branch 



CaU 

lump and 

Link 
Return 

from Trap 



Alternate space 1 
F]oaung-p 0 i m queue 
Floating-point status reg. 

Alternate space 1 
Alternate space J 

Not- 1 
Set cc* 
Set 

Extended 



Left, rizhi 
PSR.* \\7M' 
tBR.* Y 4 * 



Atomic- 
Atomic 2 



Set cc J 

Performs Add 
Shift by 0 to 31 bits 



Signed 



Signed 



22-btt displacement Integer cc" 

Floating-point cc J 

. Coprocessor cc 4 

' Execute delav m 

■30-bi, dbptacemen, if " 0t 

Word 

Word 



>Thc Jeit « si^ificam tux> bits of ih<~ 

processor siait* retfW J ° f fhc da <* * CT « "=?e <ype lap. 

'Window in valid mask 

*Trap base rcgisier 

*V register <used bv MI 'f ^rv^ 

~o„ « 3!ely folIowiiis a . ns[njcrion 



Delayed 
DeJayed 

Privileged, delayed 



Byic 

Hqlfword 
Word 

Double word 

Single 

Double 



8 bits 
16 bits " 
32 bits 
64 bits 

32-bit floating-point value, 
noaitng-point value 



In Tables A and B we H sit M rh i - 

convert to self. IJO <»>-ing-point 
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While pipelining is not a part of Soarc ,h, • . 
»on set design allnu/c - a P arc * ™e instruc- 
mentations/fi; keeiZ th ! 1Cm imple- 

instructions at once. execution of several 
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Figure 6. Branches: taken (a) and not taken <b). 

fetched. The branch resided at an even-word address 
so us delay instruction was also fetched at the same 
time. Complexities arise when the fall-throuzh or delav 
instructions are not in the instruction queue. These 
infrequent cases do not significantly affect perform- 
ance, although they do complicate the Ionic design 

The instruction at the target address of a branch is 
always fetched in the E stage of the branch {Figure 6) 
In the subsequent cycle 3. when the compare (Cmp) 
instruction computes and makes known the condition 
SS^US" £T l J S Cither dec °ded immediately or 

2E£^ r'n e b - ranCh 55 " 0? raken ' thc ^fetched 
instructions followmg the branch/delay pair (the fall- 
through instructions) execute, so no cycles are lost. 

FPU, coprocessor interfaces 

The B5000 implements the Sparc floatinc-point in- 
terface and coprocessor interface symmetrically, so 
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discussion of the floating-point interface applies 
equally to the coprocessor. Sparc allows register-to- 
register floating-point operations to execute and com- 
plete in the -background" while the pipeline continues 
to execute integer instructions. However, a floating- 
point operation may complete by generating an excep- 
tion, rather than an arithmetic result. " 

To pinpoint the instruction causing an exception, the 
u • ™* ntains a q ue «e of pending instructions and 
their addresses. The queue can be read after an excep- 
tion occurs to determine which instruction actually 
caused the exception and which subsequent instruc- 
tions had been issued but not completed. Each entry of 
he queue contains an instruction and its address. The 
IU d.spatches instructions on the store-data bus in the 

,h?£pr k P T 1 ^- The address « ^ generated on 
the FPC chip, which has a copv of the E stase program 
counter (called the XPQ. The IU managed the SE 
with its increment XPC" control, and by loading it 
from the store-data bus following any control transfer. 
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Reducing the Branch Penalty 
by Rearranging Instructions in a Double-Width Memory 
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Computer Science Institute, FORTH 
Heraldic, Crete, Greece 



ABSTRACT: Li a pipelined processor with an instnictioto-fetch 
throughput of two (consecutive) instructions per eyeie, one 
method to reduce the branch penalty Is to rearrange die code by 
placing (copies of) instructions from both targets of. a branch in 
the double-width fetch sirearn after that branch;- This scheme is of 
interest e.g. when the number of fetch cycles is large, thus mak- 
ing it hard to fill all the delay slots with insfructions from before 
the branch, and when the hardware has super-scalar capabilities 
but the compiler does not find enough instructions for parallel 
execution in the basic block where a branch is predicted tp' go. 
We study this scheme of rearranging instructions, and we evaluate 
its performance (execution time and code size) in the case where 
no parallel instructions are scheduled in the delay slots* - 

KEYWORDS: Pipelined computer architecture, Branch penalty, 
Delayed branch, Delay slot* Rearranging instructions into aelay 
slots. Super-scalar computer architecture, Double-width Instruc- 
tion memory. 



1. INTRODUCTION 

Branch and jump instructions disrupt the* regular flow of instruc- 
tions through the pipeline of modem processors, thus negatively 
affecting their performance. In order to cope with this situation, 
numerous schemes have been deviced and implemented to reduce 
that loss in performance {LeSmS4] [iiljSS] [FaHeSo"] 0>eRo87]. 
Figure 1 presents a model of a pipeline that wc wall use. In that 
figure, the first instruction in the pipe, which was fetched, from 
address A* is a conditional branch- to address B.- Say that the . 
pipeline has k msiniction-fetch stages, that branch instructions 
compute (or know) their target address by the end of the im+ l)th 
stage, and that the truth or falsity of the branch condition is known 
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by the end of the {tt+l)th stage. In a naive architecture and 
implementation* the n cycles after the branch bstruction are lost ' 
because it is not yet known which instruction to process (assum- 
ing #>m}. 

In a slightly more clever implementation, these cycles are 
lost only when the branch succeeds, by letting the instoietions that 
were fetched in the meanwhile from A+l through A -fn, execute 
and commit in case the branch condition fails, Unfornmately, the 
more frequent case is for the branch to succeed, especially when 
ttmxmdiiional jumps, procedure calls, and returns are also 
counted. The situation can be ameliorated using various hardware 
or software methods. By feeding every fetch address A into a 
special hardware branch-target buffer (BTB) (version #1), the 
target address B may be generated in less ihan m+l cycles, and 
we can start processing instructions from that address right away. 
If tile BTB works in one pipe stage* there are no cycles lost when- 
ever we hit in the BTB . and the prediction of me branch direction 
that the BTB makes is correct Another kind of branch-target 
buffer (version #2) is a cache that contains the nrsr few instruc- 
tions of basic blocks that are destinations of branches (or 
jumps/calls/returris); (a loop buffer ts another similar piece of 
hardware). If that cache can be accessed in less than k cycles, 
then it can quickly supply the first instructions after a branch, 
while the main instruction-fetch pipeline gets restarted. A third 
hardware method of speeding up branches can be applied to pipe- 
lines where the branch target address" becomes known 
(sigxu^caritly) earlier than the branch condition (*i<n, or pres- 
ence of BTB #1, or a special prepare-to-branch instruction placed 
' by tie compiler a few slots before the branch itself). By fetching 
mslructions simultaneously -from the two different possible 
streams -A +t and B +j\ the hardware can later decide which one 
of these streams to execute from. 

On the other hand, the software methods to reduce the cost 
" Of branches are all based on letting the n delay-slot instructions^ 
A+l through A+n do useful worjk (delayed branches). One 
scheme Sets these instructions unconditionally commit, and 
expects the compiler to move there useful instructions that can 
safely be moved; these instructions may originate from before the 
branch* if they do not affect it, or from one of the two targets, if 
they are harmless should the branch go in the other direction. 
Another scheme lets die delay-slot instructions commit only if the 
branch goes in a prespecified direction — otherwise, they are 
squashed? this lets the compiler move into the delay slots iristrac- 
lions- from one of the two targets, regardless of what their effect 
would be should the- branch go in the other direction. The success 
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F^wre 2; Rearranging instructions around a conditional branch 
fci (a) Instructions are conventionally arranged in an instruction memory. In (b) the width of the memory was doubled, 
and the instructions were rearranged (instrucuon addresses increase from left to right and from lop to bottom). The 
branch instruction is followed by «=3 delay slots (single or double). The instructions, m these slots that are marked 
are executed only if the branch fails, those marked *V are executed only if the branch succe^, while those 
marked "fs" are executed in either case. Inaction b was moved from before the branch into the first delay slot and 
will be executed unconditionally. Before fetching has had the lime jtb revert to (y, z) (if ihe branch is taken), wj and 
(f f x) are automatically fetched and the CPU can selectively execute either e and / or w and x. . 



the increasing size diSfercnce between the I-Cache and the register 
file* The cycle lime of the CPU is largely dictated by the cycle 
time of the register-file — a RAM of size one or a feW kilobits, 
made using high power and speed cells. On the other hand, the 
instruction cache tends to contain one or more hundreds of kilo- 
bits, by current and Mure standards. Given this large discrepancy 
in size, the I-fetch latency is likely to grow to two or more pipe- 
line stages. For example, in the Prisma GaAs processor fWilsSg* 
instruction fetching required two pipeline stages (versus half a 
stage for the register-file cycle-time). 

Section 2 presents our scheme in more detail* explaining 
how the delay slots can be filled in each case. In section 3 We 
describe the post-processor which we implemented for re- 
arranging assembly code for our scheme, and for collecting meas- 
urements. Section 4 presents our static and dynamic/measure- - 
merits regarding code size and execution cycles. Section 5 com- 
pares this scheme with branch-cost reduction schemes, and -sec- 
tion 6 draws the conclusions. Section 2 h long because it 
describes in detail all the various cases of code rearrangement: that 
may arise; readers who are less interested in these can skip over 
its last part and go to secdon 3. 



2. SEMANTICS AND USE OF THE DOUBLE-WIDTH 
DELAY SLOTS 

Several variations of the proposed scheme can be defined, 
depending; on the exact placement and *he semantics of the delay 
slots that follow the control-transfer instructions in the double- 
width instruction memory* The number of choices is even greater 
when this scheme is combined with .a super-scalar, architecture, 
where the pairs of instructions may be intended sometimes for 



sequential execution; other limes for parallel execution, and oSher 
times for alternative execution depending on the outcome of a 
recent conditional, branch* 

In this paper, we assume a scalar, rather than super-scalar, 
architecture, firstly because we want to study our scheme indepen- 
dent of super-scalar issues, and secondly because the software that 
we use for rearranging the instructions and collecting measure- 
ments does not perform any dependency analysis, and thus cannot, 
decide . whether or not two instructions can run, in parallel. 
Specitnciy, we assume that the CPU can decode and start execut- 
ing a 1 single instruction per cycle, even though it fetches pairs of 
instructions from the I-cache. Figured illustrates the implication 
qf this assumption in a pipeline similar to that of figure 1 execut- 
ing the rearranged code of figure 2(b). The instructions that are 
fetched during the delay slots * can. be distinguished in two 
categories. For some of the initial ones (instruction, b here), the 
decoding and die beginning of execution occurs before die condi- 
tional branch has. decided, and thus we have no choice when start- 
ing to execute them. For .the rest of the delay slots, decoding only 
happens after the branch has been resolved; since two instructions 
are fetched per cycle* we cart selectively decode and execute one 
. of them in each, cycle, depending on the branch decision; we will 
call these- selective, delay , slots. Non-selective .delay slots are like 
corrvenrional delay slots ~ the processor executes something 
independerit-of the branch direction These will typically he filled 
from the basic blocfc before the branch instruction^ while selective 
delay slots are filled from the two basic blocks that follow the 
branch. If there axe k fetch-stages, and if a branch is resolved at 
the end of" the +l)th stagey then there are n delay slots, the last 
k of which are selective. Note that if an implementation has addi- 
tional hardware in the first stages .of its execution pipeline, so that 
h can start processing two instructions while later finishing up 
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F&are 4: Delay Slots of unconditional transfers 

^ th^xam^e. the pipeline has^3 delay .lots. The target of they^ instruction in (a) is known at compile time, so 
S? ^ * ^ °* « ow ^ * e **** or the return m (c) is -of known, so she r^-rangement xesult, in 

fZ* ^ e W m ») specifies lhai it nses all 3 delay slots, while the m«m in (d) says thai only 2 of the slots that fol~ 
^l^loT^ ^ mSm,CtlCnS - ^ in ^ a ^ e^ution, but at least no code spsc* is 



delay slots; 110 cycles fere Iosl This Is a method to achieve 
object>code compatibility between implementations that use the 
same instruction set but have dilferent n the compuer would 
then try to optimize for the jpipeKne with the maximum expected 
number of delay slots, and thecode will run with no wasted cycles 
on machines with smaller n , 

2 J Handling Single Jumps or Branches 
Figure 4 shows how we use the aforemmtibrted count of. actual 
y ff^'^ ^o^liional jumps. First the jump instmctfon is 
moved k places up in its basic block <BB); if the BB is small as 
then instructions from the target basic block are broueht to 
fill the rest of the delay slots (part (b) of the figure); these instruc- 
tions may be removed from their original place if they cannot be 
reached through any other control path. However,,not all jumps 
have a target that is known at compile time: In (c), the return 
instruction, which is a register-mdhrect jnrnp in RISC architec- 
tures, was moved as far up as possible in its BB (see (d)). bat the 
rest of its delay slots could not be filled; thus r a count of 2 rather 
than 3 slots is specified in that instruction, (A more scphisttcated 
optimizer could move the return farther up by copying and merg- 
ing usBB with the BB's that lead into it, but we did not do that m 
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our measurements). The indexed jumps resulting from switch 
(case) statement are an even harder case, because neither is their 
target knpWn nor can they 3>e moved up in their own BB, due to 
address computation dependencies, 

Next, let us look carefully at. the delay slots of conditioned 
branches. Hgure 5(a) shows agam. the reairangcd code of figures 
2(b) and. & + The branch instrucuon specifies the count of 
delay slots that follow the branch. We assume that there is an 
agreement between the compiler and the implementation as to 
now many of these slots are non-selective, while the rest of them 
are selective - in £gure 5 there is always one nonselective delay 
slot after every branch, and in part (a) of the figure instruction b 
could be moved thete. For thesclecuVe delay slots,, the agreement 
is that the instruction **on the left" is to be executed if the branch 
fails, while fte one "on the right" is for the case of successful 
branch.. Thus, no /, <S\ or Js marks are actually necessary. 
Remember that we assumed that only one instruction per cycle is 
decoded, even, though two instructions are fetched; thus, the selec- " 
tion between instructions e. and w arid between / and x in the 
two remairung (selective) slots can/j#f be based on their content, 
Le. on their/ or s marks - it has to be based merely on their posi- 
tioning in the memory. The "right-hand side" selective delay 
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Figure 5: Selective delay slots after & conditional branch 

The code of figure 2 is rearranged for up to nz* delay slots, (he first one of which is non-selective. Paris (a> and (b) 

ZlHf "T^m ? C i f m,Cd ° nS for * & tWO ~. rf *%™«tt of the branch. lh (c), aai instruction front the 
destination basic block had to be moved into Ihenon-sclective slot; it must be squashed if ihe branch fails In (A less 
nTStrucuens could be moved into the selective dots; to reflect that, the branch instruction specifies a count of only 2 
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even more if more branches are present. When the width of the 
instruction memory is two words, this situation cannot be handled. 
We then specify that the first branch will have less than the max- 
imum number of delay slots, or we insert some tioop(s). 

Let us first consider the case when, a branch is closely fol- 
lowed by another branch, on its success path (on, the "right 
side"), as in figure 6(a). In this figure we assume that branches 
have n^A delay slots, consisting of 1 non-selective and k=3 selec- 
tive slots. We also assume that instructions a and p« which pre- 
cede the branches, can be moved Into *he non-selective oelay slots 
after them. For legibility, the right-side selective delay slots of all 
branches have been marked with a double line (the real imple- 
mentation does not contain any such marking), in each part of 
this figure, original code is shown on the left and rearranged code 
is shown on the right. In part (a), the second branch and its non- 
selective delay slot {p ) require a single I-fetch stream, and. thus 
can be moved into the delay slots of the first branch.* However, 
the selective delay slots of the second branch (qtw w r/Ks/y) need 
two I-fetch streams and thus can/w>f be moved there. Since only 3 
of the right-side delay slots of the .first branch could be filled, its 
slot count was reduced from 4 to 5; as a consequence, one execu- 
tion cycle will be lost whenever that- first branch succeeds. In 
order riot to lose that cycle, we considered moving instructions 
from one of the two streams that Follow the second branch into 
the selective slots of the first branch. This would be possible, but 
(l) it would further complicate the semantics of the delay slots, 
and (it) it would require an arrangement of the instructions in the 
selective slots of the second branch different from the one 
required when control enters that block through another -path (path 
labeled in figure 6(a)), Since our measurements* indicated 
that for £<3 we lose at most 1% of the total execution cycles due 
to such reductions of the delay slots of branches, we decided tb 
stay with the simpler scheme. Another observation is mat, during 
die delay slot cycle which is lost for me first, branch, the second 



branch does useful work, and thus it will need one less delay slot 
for itself; again, in the interest of simplicity of semantics, we do 
not take advantage of that, 

•Part (b) of Sigurd 6 shows what we do when a branch is 
closely followed by another branch on its failure path (on the 
"left side*'}. Again, we. cannot move more than- die second 
branch and its non-selective slot<£) into the selective slots of the 
first branch.' This timc r however, we choose to fill the extra delay 
slot with a noop rather than reducing the number of delay slots. 
In this way, the noop cycle win be wasted whenever the first 
branch fails, but no cycle will be lost when that branch succeeds. 
If we chose to reduce the number Of delay slots, then no cycle 
would be wasted on failure, but a cycle would be lost on success; 
given, that branches succeed more frequently than they fail, We 
opted for the noop solution! 

23 Other Cases Needing Attention 

Figure 7 illustrates some other fine points thai arise when a 
control-transfer instruction is So close to another that it is moved 
into its delay slots. Lathis figure, n~k y Le. branches only have 
selective delay slots. Besides the.branchrbranch case, which was 

. illustrated in fig-^6, figure 7(a) examines the jump-branch case. 
Here, the branch, together with as many of its (selective or non) 
delay slots, is moved into the jump's, slots. Notice that the align- 
ment of the r/x and sty delay slots of the branch that is shown in 
figure 7(a) causes a performance loss problem- When the jump 
transfers the fetch process to instruction x , x is retched together 
with £ rather than together with rj thus, if the branch fails, one 
cycle will be lost until r 5s fetched together with y . That loss is 

■ avoided when the ($\o r fix* sly) alignment is different; if no 
sequential path enters that block, we can always force the align- 
ment which is favorable to us. The same performance loss prob-^ 
lem "is present for branch-branch pairs, as in figure £(a); in that 
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Figure 7: Other close combinations of controktrarusfer instructions 

hi this figure n*=k^3. When a branch is moved into the delay slots of a jump, as in (a), the selective delay slots of the 
former get spread in two regions. When a procedure Call is. moved into the delay slots of a branch, as in (b); the return 
address that it saves (pointer io q) must be generated according to dynamic pipeline mformation rather than static 
instruction addresses. 
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We have more confidence in our measurements fiom. the first 
compiler, since it fills more delay slots man the GNU compiler 
does, and since the MIPS compilers postpone such fclling till after 
our postprocessor. The above compilers produced assembly code 
for the following four benchmark programs; 

* the ccJ part of GCC - GNU C compiler for 68020, version 
135. Tha ecl part is the core of GCC and does the bulk of th& 
work; it contains 83/700 lines of code; its object code<5UN cc) 
contains 129 ,129 instructions. In ore dynamic measurements, 
ecl itself was fed to gec -O -S; 2.6 billion uislrucOons Were 
executed* 

* troff, Jan. 86 version, from the ditroff packager it contains' 
8,530 lines of code; its object code (SUN cc) contains 1.4,731 
instructions. ^ e dynamic measurements, a 2~MBytetexfc file 
was urn through troffr 1 .1 billion instructions were executed. 

* CommonTeX - a C-Janguage version of the TeX typesetting 
program; it contains 20 f 200 lines of code; its object code ($UN 
cc) contains 45 ,145 instructions. 

* SPICE version 2g6, converted from FORTRAN into C by an 
automatic translation program. SPICE performs lots .of ' 
numeric computations In order lo simulate the analog behavior 
of electronic circuits. It contains 33,<K?G lines of code; its 
object code (SUN cc) contains 107,328 instructions. 



4. CODE SIZE AND BRANCH CYCLES IN TH*= 
SCHEME 

Figure o shows the static measurements that we collected using 
our postprocessor, as described in the previous section, in order to 
evaluate the code density under the new scheme. The numbers 
shown are for code generated by the SUN-4 C compiler.. En these 
measurements, wherever we show percentages in or over the 
"original code", we mean the codcgenerated.by the compiler.and 
adjusted by us for an architecture with no delayed branches and 
no noop*$„ 

Figure 8(a) ^hows the static percentage of control-transfer 
instructions (CTI's) in the original code, for the four benchmarks, 
broken down in conditional branches (labeled b% unconditional 
jumps to known targets (/), procedure calls (c% returns {ranging 
between 0.1 and 1.7 %), arid indexed jumps: resulting from switch 
statements (ranging between 0.05 and, Cc/, tmfr and/er 

have considerably more CITs than spice, which is a numerical- 
computation program. The average basic block size ranges from 
4 or 43 instructions for the non-numeric programs to 9 instruct 
dons for spice, hi the MIPS code, the percentages of CITs are in 
general higher - each of the branch, jump, and call category has 
about an addidonal 1%. MIPS basic blocks are about half an 
instrucdon shorter. These effects are mainly due to the existence 
of combined compare^snd-branch instructions in MIPS, The 
GNU compiler produces less CITs than the SUN/DEC compilers. 

Figure 8(b) shows the value of the ''replication factor". 
When instructions from the target of a jump or branch are moved 
into its delay slots, they have to be replicated when another path 
can sail reach them at their original place; otherwise,, they can 
simply be "moved" (figure 4(b), .figure 5). The replication factor 
shows how often instructions have to be replicated,, relative to the 
total number of such instruction movements. Wfcen a basic block 
can be reached through sequential execution, every branch .or 



jump into it needs to replicate ail instructions that it takes; when a 
basic block can be reached only tnrough branches or jumps, inert 
all but one of these CITs need to replicate. The interesting con- 
clusion is that this replication factor varies within a relatively nar- 
row ran£c around 0,77, 

Atthis point we should note that the replication factor can 
be reduced with some help from the compiler/ JFor example, the 
SUN-4 compiler, which we used for these measurements, com- 
■piles Misstatements into: "±Z (qot expr) branch to 
-<*xit 7 -loopc ■ <Wy ? if texpx) branch to 
loop) ; exlfc: ,v . That leads to Teplicadon of the first few 
instructions pf the body after the Second branch- If die white 
were compiled: ** jimm to test; loop*, (body; t*sfc* 
*-£ (expr) branch to loop) ; exit:", then no such 
replication would be needed. 

Figure 8(c) showsr the. degree by which branch and jump 
delay slots are filled with us&m instructions, as a function of *, 
the number of slots per CTI. These number refer to the right-side 
selective delay slotsof conditional branches, and the delay slots of 
unconditional jumps to known targets (calls are not included ~ 
theu- delay slots can always be filled for the values of k that We 
considered). The jump delay slot* are almost always Glled, either 
with insrructions from the basic block before the jump, or with 
instructions from its target The branch slots can always be filled 
for £=1 and almost always for k=2z for fc-3, 94% of them can be 
filled <*he reasons why the rest of them cannot were illustrated in 
figures 5(d) and .6(a)). The product of branch mstructron fre- 
quency m the original code, times k (the number of selective 
delay slots per branch), times the branch fullness, times the repli- 
cation factor gives the amount of code size increase that is due to 
■ the Selectrve slots of branches. 

Figure 8(d) shows the overall increase of code size over the 
size of the original code, for each of the four benchmarks com- 
piled by the SUN^4 compiler. Each number that is shown as a 
vertical bar with solid lines is broken down into its three contribu^ 
dons: selective delay slots of branches (labeled delay slots of 
unconditional jumps to known .targets (J), and delay slots of pro- 
cedure calls (c). (The delay slots of. indexed jumps to casSs are 
not filled, and those of Kefwm's are only filled with instructions 
from before them, so these two categories do not contribute to 
increasing the code size). The solid lines correspond to the cases 
where branches only have selective delay slots: n~k. Measure- 
ments are given for far 1, 2, or 3 fetch pipeline stages. When 
n-~k+l t *.ev when branches are resolved in the second execution 
stage and hence they have one nonselective delay slot, che contri- 
bution to xncreased code size by the above factors was measured 
to be almost identical - usually it increased by about 0:1%, at 
most by 0.4%. and in One Case it decreased by 1% (remember that 
when /t~£-Kt, unconditional CTFs are assumed to still have k 
delay slots). What changes when is the presence of the 

non-selective delay slots ot branches., for which our postprocessor 
does whatever the SUN^ compiler did. That compiler manages 
to fell only about 20% of them with instructions from before the 
branch, about 73% of them with instructions from the branch tar- 
get (to be squashed {annulled) if the branch fails), and fills the 
remaining 7% with noop's, These noop^ as- wet! as the instruct 
-Hons fomnie target limes the replication factor, increase our code 
size m the same way as they do for the The total code 

sjze increase when n^k+\ is shown in figure 8(d) with dashed 
lines. Overall, we see that the code srzejmay increase by as little 
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Figure 9: Dynamic measurements 
Part (a) shows the percentage of One various CTI's that are executed in the original code,, for the two benchmarks.. Part 
(b) shows the additional execution cycles, over die total original cycles for executing, the entire benchmark, when there 
are k fetch stages and when conditional branches have n delay slots. The additional cycles arc broken down according 
to their main causer non-selective branch slots, selective branch Slots* delay slots of returns, delay slots of indexed 
jumps to case's. 



in Ehis figure because their contribution to lost cycles: is negligible: 
at most 0.001 Procedure calls dtf not appear either because 
their delay slots are always filled, and thus no cycles are ever lost 
because of them. It however, the linker did not copy instructions 
&om die head of the called procedure into the delay slots of calls, 
then a major contribution to the lost cycles would be added: 03% 
(£=1). or 13% (£ ^=2), or 2.5% on the average. , 

The dashed lines in figure 9(b) are for the case. n=£+l t 
when conditional branches have one more, non-selective delay 
slot When that slot is filled wiih an instruction from the branch 
target and the branch fails, or when it is filled with nnoop, then a 
cycle is lost; these cycles account for a major portion of the 
wasted cycles (about 4 % of the. original, cycles) .when /t~Jfc*fl. 
This number is exactly the same in our architecture as it would be 
in any conventional RISC with one branch delay slot; any com- 
piler that can improve the utilization of that delay slot on a con- 
ventional RISC will also have the same effect on the scheme of 
this paper. In figure 9(b), the number of lost cycles is for SPARC 
code. Our corresponding measurements for MIPS code gave the 
same percentage of lost cycles for cci* while troff gave a* better 
(lower) percentage by about 1% for k*&. 

The central point of this paper are the selective delay slots. 
Thus, it is worth looking more carefully to their contribution to 
wasted cycles; table I does that. This table contains fractions 
over the number of executed branches rather than the number of 
all executed instruction of figure 9. Thus, its numbers can' be 
interpreted as "cycles per branch**. The contribution of non- 
selective delay slots is ignored. The four contributions that may 
waste selective delay slot cycles are: 



(t) . case label inside branch delay slots {%, 5(d));. 
( ii) branch followed by branch on the failure side (fig. 6(b)); 
(lit) brand* followed by branch on the success side (fig. 6(a)); 
(iit) misalignment effect -see figure 7(a). 
Table 1 shows the contribution of these factors. 



TabJel; Cycles per conditional branch (n~k) 


k , 

fetch 

St. 


Contributing factors 


Total 


br. 
■itself 




m 




W 


. 1 


1.00 


0.0007 


0:0000 


0.0000 


0.0003 


1 1.001 




1.00 


0.0014: 


0,0001 


0,0215 


6,0139 


! 1.037 


3 


*,00 


0.0022 


0.0005 


0.O&S0 


0,0276 


1-U8 



We see that- the scheme of .this paper dramatically 
decreases the number of cycles per conditional branch: even in the 
case k~3, when a dummy pipeline would take. 4_G cycles per 
branch, our sfchemo only takes 1.1 cycles per branch. The major 
contribution to wasted cycles is from branches closely following 
OJher branches on .the success path. 
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This scheme requires somehow more cods size expansion (e,g. 25- 
to 45 %), while it can reduce the cycles per branch down t$> 1.05 
to ia5 for two to four delay slots. Since the code size do.es not 
differ much, the main cost for- die clearly better' performance of 
the new scheme is Hie double-width X-feteh bus. Which w#y the 
tradeoff leans depends on the rest of the architecture and imple- 
mentation. On-chip instruction RAM's can usually supply (he 
additional bandwidfit at small cost Superscalar architectures 
already have the additional bandwidth;- for branches with little 
parallelism in their predicted target block, this scheme offers a 
good alternative for filling their delay slots. 
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